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EXECUTIVE SUMMARY 

This report documents the results from an evaluation of the CARE m and 
ARIES 82 reliability tools for application to advanced fault-tolerant aerospace sys- 
tems. The results of this investigation are expected to provide guidance for plan- 
ning future reliability modeling research and development. 

To determine reliability modeling requirements, the evaluation focuses on the 
Charles Stark Draper Laboratories’ Advanced Information Processing System 
(AIPS) architecture as an example architecture for fault tolerant aerospace sys- 
tems. A number of simple reliability problems were formulated and analyzed 
using CARE ID and ARIES 82. From these test problems and from the reliabil- 
ity modeling requirements of AIPS, advantages and limitations were identified for 
CARE ID and ARIES 82. 

CARE ID, which was designed primarily for analyzing ultrareliable flight 
control systems, was found to have many desirable features. Among these were 
the capability of handling large systems, a somewhat flexible fault handling 
model, nonconstant failure rates in the fault occurrence model, the provision for 
near coincident double faults, the computational accuracy required for analyzing 
ultrareliable systems, and the user interface, although not fully interactive, which 
provides for simple and flexible system definition. 

Examination of the reliability modeling requirement for the AIPS architec- 
ture, particularly for the long mission times in space applications, revealed several 
current limitations of CARE ID. System scenarios which were difficult to model 
or could not be modeled with CARE ID were 

• Systems with unpowered spare modules, 

• Systems where equipment maintenance must be considered, 

• Systems where failure depends on the sequence in which faults oc- 
curred, 1 and 

• Systems where multiple faults greater than a double near coincident 
fault must be considered. 2 


1 Appendix B.2 of the CARE m Users Guide describes a method to analyze time sequence dependent 
faults. This method essentially establishes bounds for reliability. Under appropriate conditions, usually mis- 
sion times which are short relative to the time between failures, these bounds should be close to the actual re- 
liability. Far longer mission times, such as space applications, where exhaustion of components and tech- 
niques such as function migration are factors, it is not dear that the suggested method will be suffitiently ac- 
curate. 

*The need to consider near coincident faults of order greater than two arises from configurations such 
as the quintuple*. With short fault recovery intervals and improved component reliability, triplex configura- 
tions may meet the needs of future systems and hence the need to analyze higher order, near coincident faults 
wcxild not arise. It should be noted, however, that short fault recovery intervals may be difficult to achieve 
particularly with respect to software components. It should be further noted that CARE Ill's inability to han- 
dle third order, near coincident faults does not arise because it fails to evaluate the failure probability due to 
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Also, the computational accuracy of CARE m is limited outside the ultrareliable 
regime. 

The ARIES 82 program, whose primary use has been to support university 
research and teaching, was found to have a number of desirable features. 
Among these were the interactive nature of the program, the ability to handle a 
wide range of system scenarios such as systems with and without maintenance 
and systems with powered or unpowered spares, the flexibility of user-defined 
state transition matrices, and the computation of performance measures other 
than reliability such as a mean time to failure, life cycle measures, and improve- 
ment factors. The primary limitations identified for ARIES were 

• The use of instantaneous coverage, 

• The use of constant transition rates, 

• The limitations on the size of systems that can be modeled, 

• Lack of formal validation, 

• Several programming errors, which were apparent from analyzing 
sample problems, 

• Limited computation accuracy, especially for ultrareliability require- 
ments of commercial air transport, and 

• ARIES is an unsupported product. 

Both CARE HI and ARIES were not suited to determine the reliability of 
complex nodal networks of the type used to interconnect processing sites in the 
AIPS architecture. In fact, this particular reliability analysis problem is not 
addressed by existing modeling tools and will require the development of new 
techniques. 

It was concluded that ARIES was not suitable for modeling advanced fault 
tolerant systems. It was further concluded that, subject to the limitations cited 
above, CARE III is best suited for evaluating the reliability of advanced fault 
tolerant systems for air transport. 


critical triple*. The probability of critical triples often are quite small. The limitation is because CARE III 
cannot exclude from the reliability calculation those near coincident double faults which would not lead to 
system failure in systems such as the quintuple*. 
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1.0 Introduction and Scope 


Digital flight control systems for spacecraft and aircraft perform life or missions 
critical functions. Extremely high reliability requirements must be established and 
demonstrated for these systems. To meet the reliability requirements, systems become 
large and complex. Size, complexity, and demanding requirements combine to make the 
prediction and validation of reliability difficult. 

During the system design phase, reliability predictions must be obtained to support 
design tradeoffs between potential system architectures. After such a fault tolerant sys- 
tem has been built, experimental techniques for establishing reliability, such as life test- 
ing and simulation, are often precluded or are of limited value due to high costs. Con- 
sequently, sophisticated reliability modeling tools based on analytic models are needed to 
predict and validate reliability for both the design and development phases of fault 
tolerant systems. 

This report details the results of an evaluation of CARE HI and ARIES 82, two 
reliability modeling tools for application to fault tolerant system architectures. The 
evaluation was performed under NASA Contract NAS1-16489. 

CARE m (Computer Aided Reliability Estimation) is the latest in a series of relia- 
bility assessment tools co- developed by NASA-LaRC and Raytheon. It was primarily 
designed for analyzing ultrareliable flight control systems. ARIES 82 (Automated Relia- 
bility Interactive Estimation System) is based on a unifi ed model for reliability estima- 
tion developed by Ng and Avizienis at the University of California, Los Angeles. Its 
primary use has been to support university research and teaching. 
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The objective of this evaluation was to perform a comparative analysis and assess- 
ment of CARE m and ARIES 82 for application to advanced fault tolerant flight control 
systems such as the Advanced Information Processing System (AJPS) being developed by 
Charles Stark Draper Laboratories. Specifically, the following tasks were performed: 

1. The AIPS architecture information was obtained and reviewed. The 
suitability of CARE IQ and ARIES 82 for AIPS analysis was deter- 
mined. 

2. A comparative analysis of CARE IQ and ARIES 82 was carried 
out. 

3. CARE IH and ARIES 82 were applied to problems of varying com- 
plexity. 

4. The limitations of CARE IQ and ARIES 82, with respect to applica- 
tion to advanced fault tolerant architectures, were determined. 

The fault tolerant features of the AIPS architecture are reviewed in Section 2.0 of 
this report. Section 3.0 provides an overview of the CARE IQ and ARIES 82 fault 
models. In Section 4.0, test cases that were analyzed using CARE IQ and ARIES 82 are 
described and the results are given. In Section 5.0, CARE IQ and ARIES 82 are com- 
pared and the limitations of each are identified. 
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2.0 Advanced Information Processing System (AIPS) 

2.1 Objectives and Requirements [1] 

The Advanced Information Processing System (AIPS) is a fault and damage tolerant 
system architecture which satisfies real-time data processing requirements for aerospace 
applications. The specific requirements for seven aerospace applications were esta- 
blished by Draper Laboratories and are given in Figure 2.1. As can be seen, a wide 
range in each resource requirement or performance parameter is covered by these appli- 
cations. 

Attributes of the AIPS architecture are 

• Growth and Change Tolerance, 

• Accepts Technology Upgrades, 

• Graceful Degradation, 

• System Complexity is Transparent to the User, 

• Graded Redundancy, and 

• Damage Tolerance. 

2.2 AIPS Architectural Features and Building Blocks [1] 

The elements for the AIPS architecture arc the Fault Tolerant Multiprocessor 
(FTMP), the Fault Tolerant Processor (FTP), a fault and damage tolerant Intercomputer 
Network (IC), a fault and damage tolerant Input/Output Network (I/O), a fault tolerant 
mass memory, a fault tolerant power distribution system, and a network operating sys- 
tem which allows the elements to operate together. 

Figure 2.2 shows the proof-of-concept model of the AIPS architecture. AIPS con- 
sists of processing sites, either FTMP or FTP, which arc distributed as necessary 
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Figure 2.1. AXPS Application Requirements 
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throughout the vehicle. They are linked by a layered damage and fault tolerant IC net- 
work. Input/Output buses provide access to Input/Output devices. Processing sites and 
I/O buses may have a global, regional, or local extent. For example, most or all pro- 
cessing sites would have access to I/O devices that are connected via a global I/O bus , 
e g-, an I/O bus that is connected to each processing site. A local VO bus could connect 
I/O devices to one processing site. Similarly, software operating systems for AIPS can 
have global, regional, or local control. Access to a fault tolerant mass memory is pro- 
vided via a dedicated mass memory bus. 

Resources within the distributed system are usually assigned to a fixed set of func- 
tions. Under certain conditions, such as a change in mission phase or a hardware 
failure, the computing resources can be reassigned to other functions. This capability 
allows for limited distributed processing and is called semi-dynamic function migration. 
Function migration is expected to be used to reconfigure system resources in order to 
achieve higher reliability for critical functions or to meet the resource or power require- 
ment due to changes in a mission phase. 

Hardware redundancy is implemented at the processor, memory, and bus level. 
Redundancy provides for fault detection and for continued operation of the system fol- 
lowing a component failure. Redundant elements are operated in tight synchronism 
resulting in improved fault coverage and latency. Fault detection and masking functions 
are implemented in hardware. Tight synchronism requires that these functions be 
invoked frequently. By implementing these functions directly in hardware, the need for 
additional computational resources required by a software implementation is avoided. 
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The less frequently invoked fault isolation and reconfiguration functions are imple- 
mented in software. 

The successful distribution of data from a simplex source to redundant processors is 
necessary to avoid single point failures. The processors must exchange their copies of 
the simplex data to assure that the same data values are being used by each processor. 
The process of establishing source congruency is supported and made efficient in the 
AIPS architecture by use of software and special hardware features. 

A triplex FTP architecture is shown in Figure 2.3. The FTP can be configured in 
simplex, duplex, or triplex processor form. Each FTP channel has an Input/Output pro- 
cessor (IOP) and a computational processor (CP). These processors have separate 
memories, docks, and timers. The IOP has interfaces to the I/O and IC buses. The 
processors have access to a shared memory, interfaces to the mass memory, and to data 
exchange hardware. The data exchange hardware is used to exchange data between 
redundant channels, to detect faults, and to mask faults. Redundant chann el s are 
tightly synchronized using a fault tolerant dock. 

The IOP interfaces to a redundant IC network. It receives from each layer of the 
IC network and detects and masks faults. However, it can only transmit on one layer of 
the network. The other layers in the network are reserved for the r emainin g redundant 
channels in the FTP. With respect to a single channel, the receive interface is cross- 
strapped and the transmit interface is not. 

The FTMP shown in Figure 2.4 is composed of a number of computational proces- 
sors (processors with local memory) all interconnected via a redundant, fully cross- 
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Figure 2.3. Fault Tolerant Processor (FTP) Architecture 




strapped multiprocessor bus. A shared memory can be accessed via the multiprocessor 
bus. FTMP configurations could consist of triads of CPs interfaced to the I/O bus, to 
the IC network, or the mass memory bus. Some triads could be connected only via the 
multiprocessor bus. 

The FTMP fault tolerance features such as synchronism, clocking, and redundancy 
are similar to those of FTP. 

The intercomputer network (IC) consists of three identical, independent layers. 
F . ach layer consists of a number of multiported, circuit switched nodes interconnected by 
communication links. Nodes are generally associated with specific processing sites. 
Communication between any two processing sites can be established by selecting a suit- 
able combination of nodes and links. If a link fails, communications between two sites 
can be reestablished by using another combination of nodes and links. * 

The VO network is similar to the IC network except that only one layer is imple- 
mented. 

In summary, some of the key fault tolerant features arc 

• FTMP and FTP Concepts, 

• Hardware Redundancy, 

• Redundant Elements in Tight Synchronism, 

• Fault detection and masking implemented in hardware, 

• Fault isolation and reconfiguration implemented in software, 

• A layered nodal intercomputer communications network with recon- 
figuration features, 

• A nodal I/O co mm u n ications network with reconfiguration features, 

• Features to support and efficiently implement the process to estab- 
lish source congruency, an d 
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Figure 2.4. Fault Tolerant Multiprocessor Architecture 
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• Function Migration. 


2.3 AIPS Requirements and Features Impacting Reliability Assessment [2] 

A number of AIPS requirements and architectural features impact reliability assess- 
ment. Among those the following are of particular significance for the purpose of this 
report. 


1. Resource requirements are, for some applications, large and hence 
the number of high level components (processors, memories, etc.) 
can become large. 

2. A high degree of fault tolerance is required resulting in the need to 
account for failure of fault handling as well as the exhaustion of 
components. 

3. Applications require both short and very long mission times. 

4. For some applications, the architecture results in large nodal net- 
works. 

5. The Intercomputer Network is partially cross-strapped. 

6. The function migration feature required for some applications can 
complicate reliability analysis. 

Some applications for AIPS will permit system maintenance and repair (open sys- 
tem), others will not (dosed system). Certain space missions will require the use of 
unpowered spare system modules, and hence the capability to model different failure 
rates for powered and unpowered components will be needed. 

The use of function migration will impact reliability analysis in a number of ways. 
For example, loss of the system function will depend on whether function migration can 


ll 


be completed. This could depend upon whether a particular fault occurs before or after 
the need for function migration. Consequently, loss of system function will depend on 
the order or sequence of fault occurrence. 

The long mission times could impact the accuracy of reliability estimates made using 
numerical approximations. 

Partial cross-strapping of the IC networks dictates that processing sites and the IC 
network cannot be analyzed independently (structurally decomposed). 

The large nodal communication networks required impact the reliability analyses. 

Figure 2.5 shows a simple nodal communications network between a triple redundant set 

of sensors, processors, and actuators. This network is connected in a planar topology. 

It can be determined by observation that the loss of two links can isolate a node and that 
\ 

the loss of three links will lead to system failure. Determining the number of failure 
combinations that lead to loss of system is slightly more difficult but is not too demand- 
ing. However, the more complex network given in Figure 2.6 is much more difficult to 
analyze. In applications using the ALPS architecture, network failures can be a major 
factor determining system unreliability. Consequently, the capability to analyze complex 
nodal networks will be necessary for some AIPS applications. Further, this capability 
could be used to develop better network topologies such as the alternate network shown 
in Figure 2.5. With this alternate non-plan ar topology, three failures are required to 
isolate a node and six failures are required for loss of system. In such a case, network 
reliability would be sufficiently high on short missions that system reliability would not 
be affected. 
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3.0 OVERVIEW OF ARIES AND CARE m 

3.1 Introduction [3] 

Because of the ultrareliability requirements of the AIPS architecture, an analytic 
method of assessing reliability is required. This method must be sufficiently general to 
cover the wide range of systems that can be developed with AIPS. It must also be com- 
putationally feasible. One widely used method is to model the system as a finite-state, 
continuous-parameter Markov process X(t), tstO. In this model, the state probabilities 
are defined as pj(t) = P[X(t)-j], the probability that the system is in state j at time t; 
the transition probabilities as p;j(t,t+h) = P[X(t+h)=j | X(t)=i], the probability that 
the system is in state j at time t+h given that it was in state i at time t; and the transi- 
tion rates qj(t) and qjj(t) as 

qijCO = ^PijO)* 1 * j 

and 

q» = 

= 

**j 

The system's state probabilities can then be found by solving the matrix equation 

P (t)=Q(t)P(t) , 

where PCt)=(pi(t),p 2 (t),...,p a (t)) is the state probability vector for the system’s n opera- 
tional states and 
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is the transition rate matrix. The reliability of the system at time t is then given by 


R(t) = 2Pi(t) • 
i-l 

Both ARIES and CARE HI use this Markovian model; however, they differ in their 
definition of states and transition probabilities. In ARIES, the Markov process is 
assumed to be time-homogeneous; i.e., the transition probabilities p^t.t+h) depend not 
on the initial timr. t but on the elapsed time h. As a result of this assumption, the states 
of the model must have exponentially distributed holding times. Fault-occurrence states 
are differentiated according to configuration so that a state reconfigured with spares is 
different from a state with the same number of active modules but in which an 
uncovered spare failure has occurred. This distinction is made because the system can 
degrade from the former but not from the latter state. There are no fault- h a n d lin g 
states: coverage is assumed to be ins tantan eous and is incorporated into the transition 
rates as a constant probability. 

In CARE m, time-homogeneity is not required and non-exponentially distributed 
holding times are allowed. The fault-occurrence states are defined only by the number 
of operational active and spare modules: no distinction is made as in ARIES between 

•>N' 

degradable and nondegradable configurations. However, the transition rates are formu- 
lated so that the state probabilities are the same as they are in ARIES. [4] Coverage is 
modeled in CARE HI by fault-handling states, which represent the detection, isolation, 
and recovery from errors, and failure states, which are entered because of coverage 


16 


failures. Both of these reliability tools are discussed in the following sections. 

3.2 ARIES[5] [6] [7] 

3.2.1 General Description 

A RIES is an interactive, unifi ed reliability modeling tool developed by Ng and 
Avizienis at UCLA. The current version, ARIES 82, is written in C for use on UNIX 
systems and is intended primarily as a teaching aid in the evaluation-based design of FT 
computers. [6] 

In ARIES, a system is defined to be a series configuration of homogeneous subsys- 
tems, e«eh of which can be modeled as a finite-state, continuous-parameter, time- 
homogeneous Markov process. State aggregation is achieved through this structural 
decomposition since, rather than considering the system as a whole, each subsystem is 
analyzed separately and the results combined to give system reliability. State reduction 
is also achieved by approximating fault-h andling states through instantaneous coverage. 

In ARIES there are six basic models defining closed, repairable, and renewable sys- 
tems as follows: 

Type 1 Closed FT System with Permanent Faults, 

Type 2 Closed FT System with Transient Fault Recovery, 

Type 3 Mission-Oriented Repairable System, 

Type 4 Repairable System with Transient Fault Recovery, 

Type S Repairable System with Restart, and 
Type 6 Periodically Renewed Closed FT System. 

The Type 1 system is a closed fault tolerant system. It does not undergo any exter- 
nal repair or renewal and all faults that occur are assumed to be permanent faults. The 
system can have powered or unpowered spares and can degrade after the spares are 


17 


exhausted. However, the system’s ability to degrade can be blocked by unrecoverable 
spare failures, since it is assumed that if an undetected and unrecoverable failure exists 
in a spare, the system cannot activate succeeding spares and will fail when that spare is 
switched in. It is also assumed that spares are periodically tested, that spare selection is 
predetermined, and that a failed module is removed from the system. 

The model for the dosed FT system (Type 1) is shown in Figure 3.1. The states in 
this model correspond to triples of the form (y,s,d), where 

y = the number of fault- free active units, 
s = the number of available spares, and 
d = the number of degradations allowed 

and (y,s,d), where 

y = the beginning number of active units, 
s = the number of accessible spares,and 
d = 0. 

The (y,s,d) states represent reconfigurations of the system as active modules fail and are 
replaced by spares until all spares are exhausted and the system degrades, terminating in 
one of two final states (safe shutdown or system failure). The (y,s,d) states represent 
reconfigurations of the subsystem that cannot be degraded because an undetected and 
unrecoverable error exists in a spare and will cause system failure when that spare is 
switched in. There are no states to represent fault handling: these states are approxi- 
mated by coverage probabilities associated with the transitions between the fault 
occurrence states. 


is 




Tliis model is instantiated by assigning values for the parameters D, S, CS, 
K Y, andQC- The parameter D is the number of degradations the system can sus- 
tain, i.e, the number of active units that can be lost without replacement; S, the number 
of spares. CS is the coverage associated with each spare; if CS < 1, then the blocked 
spare states, (y,s,d), of the model can be entered. X and p. are the failure rates for the 
active and spare units, respectively, and are assumed to be constant. If p. = X, the 
spares are assumed to be powered; if p < A, they are assumed to be unpowered. 


Although unpowered spares are allowed, p must be greater than zero and — must be 

P- 

no greater than 10*. The number of active modules in each degraded configuration is 
entered as a vector Y = (A, A-l, .... A-D, A-(D+1)), where Y[0] is the initial 
number of active units, Y[i] is the number after the i-th degradation, and Y[D+ 1] is the 
number m the safe shutdown state. The coverage probabilities associated with the tran- 
sitions between configurations is entered as a vector 

^ (^A> ^A-i» •••> Q\-d> ^a— (d+i))> where CY[0] is the coverage probability used 

for all transitions while any spares remain and CY[i] is that used for the transition to the 
i-th degradation. If there are no spares in the system, CY[0] is never used. 

Each of the six models has an identifying set of parameters. These parameters 
specify configuration, failure modes, and coverage mechanisms for each system type. 
For a complete list of the ARIES parameters, see Figure 3.2. 


Systems that do not conform to the assumptions for Types 1 - 6 or cannot be 
decomposed into subsystems of these types cannot be accurately described by those 
models. However, any system that can be represented by a single state transition-rate 
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Y[0] = Initial number of active modules 
S = Initial number of spare modules 
D = Number of degradations allowed in the active set 
X = Active resource vector (Y[0J,... I Y(DJ,Y[D+1]) 

Z. = Computing capacity vector (Z[0],...,Z[D],0) 

X = Failure rate of one active module 
= Failure rate of one spare module 
u = Failure rate of one good module in safe shutdown condition 
t = Transient fault arrival rate of one active module 
D = Mean duration of a transient fault 
CS = Coverage for recovery from spare failures 
CY(i] = Coverage associated with the transition to 
the degraded configuration specified by Y[i) 

CY = Coverage vector for active failures 
= CY[0] , . . . C Y[D] , CYJD+ 1] , 

NP = Number of recovery phases for transient faults 
CR = Recoverability from transient faults 
X = Interference rate for transient faults 
= The failure rate of all hardware involved 
in executing the transient recovery processes 
T[i] = The duration of the ith recovery phase 
for transient faults 

T = Recovery duration vector for transient faults 
= T|I],...,T(NP] 

CE(ij = The effectiveness of the ith recovery phase 
for transient faults 

C'E = Recovery effectiveness vector for transient faults 
= CE[lJ,...,CE(NPJ 


Figure 3.2. ARIES Parameters 



matrix can be solved by ARIES. For these Type 7 systems, the user enters the complete 
system transition-rate matrix rather than specifying values for model parameters. This 
user-specified matrix is then incorporated into the solution of the system in the same 
manner as the matrix that is generated from the fixed ARIES models. 


3.2.2 Solution Method 

As a result of the time-homogeneous restriction in ARIES, the transition-rate 
matrix Q(t) simplifies to 


Q(t) 

and the matrix equation simplifies to 


fll , i * j 
\ q j , i = i 


P'(t) = QP(t) . 

Thus, 


P(t) = e^O) . 

This system is solved in ARIES as 


p(t> = !«•* 

i -1 


n 


j-l <*\ 

J* 1 



P(0), 


where o. ; is an eigenvalue of Q.[31 The solution’s use of Sylvester’s theorem to evaluate 
e 01 requires that the transition-rate matrix have distinct eigenvalues. This requirement 
restricts the ratio of the active unit failure rate (X) to the spare unit failure rate (p.) in a 

system with unpowered spares to — 10 6 , 0<p.<\. 


To implement the solution, the transition-rate matrix, Q, is determined from either 


AMI S-MA 
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eigenvalues of Q are computed from the model parameters; otherwise, they are com- 
puted by reducing Q to upper Hessenberg form and applying the QR algorithm. If non- 
distinct eigenvalues occur, the duplicates are dropped from the computation. 

Next, the probability polynomial coefficient matrix, B, is constructed from Q; the 
distinct eigenvalues, a, of Q; and the initial state probability distribution, P(0). Hie ini- 
tial probability distribution P(0) for closed and repairable systems and closed phases of a 
PRC system is 

P(0) = (1,0,..., 0) ; 

for die renewal phase of a PRC system, 

m = (i,o <>)=<« 

where Qi is the transition rate matrix for the closed operation phase. 

After B is constructed, P(t) is computed for each state k, from B and a., as 

Pk(«) = Sbl'e-T 
i 

Once the state probabilities are solved, the reliability of the subsystem is computed as 

m = 2 p k(0; 

k 

i.e., as the sum of the state probabilities of the constituent states. 

With the reliability Ra(t) of each of the n subsystems comprising the system thus 
computed and with the assumption of serial configuration of subsystems, the system reli- 
ability R(t) is computed as 

R(t) = nRi(t) . 

i-1 
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3.2.3 Outputs 


System reliability is reported for user-specified time intervals. For ea ch tim e inter- 
val, the reliability of the complete system and the reliability of each component subsys- 
tem is reported. The report is displayed on the terminal screen but can also be written 
to a log file, plotted on a SOLTEC281 plotter, or filtered to a UNIX plotting tool. 

In addition to system reliability, ARIES can compute and display the m ean time to 
first system failure, the normalized percentage of failure of each component subsystem, 
the reliability improvement factor and the mission time improvement factor of one sys- 
tem over other systems, the system failure rate, and, for renewable systems, life-cycle 
measures. 

3.3 CARE m 

3.3.1 General Descriptlon[8] [9] [10] 

In CARE HI, a system is defined to be a configuration of stages, where each stage 
is a group of identical modules. Stage failures are independent. Stages wi thin a system 
may be dependently coupled as described by a fault tree. A module occupies a distinct 
state for each combination of its fault status (whether a fault has occurred or not), fault 
category (mode of failure and associated occurrence rate), and coverage state (detection 
and handling of the fault). Denoting module a in stage x by (x,a), the states occupied 
by a are defined by the vector (d(x,a),i(x,a),c(x,a)), where 

(0 if (x,a) is operationalV 
d(X,a) “ |l if (x,a) is faulty h 
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i(x,a) = fault category, and 
c(x,a) = coverage state. 

The states of the system are then defined by the M-dimensional vector (d t i f c) T where 

d = (d(l,l),...,d(x,n(x)),...,d(N,n(N))), 

i = (i(l,l),...,i(x,n(x)),...,i(N,n(N)», 

fi = (c(l , 1) , . . . ,c(x,n(x)) .... ,c(N,n(N») , 
n(x) = number of modules in xth stage, 

N = number of stages in system, and 

M = 2 n OO- 

x-l 

To reduce the number of system states, aggregate states are constructed by group- 
ing states according to the number of faulty modules in a stage, the system fault tree, 
the coverage structure (i.e., fault-handling states), and the critical pairs fault trees. This 
reduced system is only semi-Markov; but, assuming a large difference between the rates 
for the coverage states and those for the fault-occurrence states, it can be decomposed 
into a semi-Markov coverage model and a non- homogeneous Markov reliability model. 

3.3.1.1 Coverage Model 

Three types of faults are represented in CARE HI: permanent, intermittent, and 
transient. A permanent fault is any fault that persists until the device is repaired; an 
intermittent fault, any fault that persists only part of the time due, for example, to a 
loose connection, a poor bond, etc; and a transient fault, any fault which is not caused 
by a permanent defect, but nevertheless manifests a faulty behavior for some finite tim* 
and then disappears. [11] The User's Guide defines error as any condition in which a 
module is incorrectly performing its function. Although the User’s Guide is not explicit 
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about the endurance of an error, the CARE Ed fault models implicitly assume that an 
error, once produced, cannot disappear. 

The CARE m coverage model consists of two models and accommodates two types 
of coverage failures: single fault and double fault. Both of these models are discussed 
in the following sections. 

3.3. 1.1.1 The Single- Fault Model 

A single fault coverage failure occurs when a fault in a module causes an error before 
the fault is detected and the module isolated. The single-fault model is shown in Figure 
3.3. 


Let F 

= event of a fault at time t 

E 

(any of the 3 fault types), 

= event of an error at time t, and 

F, E 

= complement of F, E. 

The states are 

A — FE: 

the fault persists but has not produced an error. 

B = FE: 

an intermittent or transient fault has healed without producing an error, 

A e = FE: 

the fault persists and has produced an error. 

Be = FE: 

the intermittent fault has healed but the error persists, 

Ad: 

the fault was detected in the active state, 

Bd-* 

the fault was detected in the benign state. 

Dp A : 

the fault was detected as permanent from A D , and 

Dp*: 

the fault was detected as permanent from Bd- 


In CARE HTs terminology states FE (A) and FE (Ap) are active latent stater, FE 
(Be) k a benign latent state ; FE (B) is the benign state. This distin ction is important since 
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Figure 3.3. Single-Fault Model of CARE m 



CARE m assumes that co-existing latent faults in two distinct modules either within a 
stage or between stages (as specified by the user) constitutes loss of system. These pairs 
are referred to as critical pairs. 

Within the single fault model, the possible transitions and the corresponding transi- 
tion rates are 


A to B 

alpha. 

A to Ad 

8(0, 

A to Ag 

P(0, 

Ae to Be 

alpha, 

Ae to Aq 


Ae to Failure 

(I-c)c(t), 

B to A 

beta, 

Be to A e 

beta, 

Br to Bd 


Be to FAILURE 

(1-c)€(t), 

Ad to A 

instantaneous. 

Ad to Dp A 

instantaneous, 

Bd to B 

instantaneous , and 

BotoDp,, 

instantaneous. 


The transition rates a ana p are constant rates; the functions 8(t), p(t), and e(t) are 
restricted to either exponential or uniform densities of the form 

6exp(-0t), t > 0 


6, 0 < t < 


6 * 


Assuming that t and r are measured from the last entry into A or E (A F or Bp), res pec- 
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tively, then the single fault coverage model is a semi-Markov process. 

The transition parameters a and 0 define the three fault types as follows: 

Permanent Fault when o = 3 = 0, 

Intermittent Fault when a * 0, 3 # 0, and 

Transient Fault when a # 0, $ = 0. 

When a Transient Fault reaches the B state, CARE HI reconfigures the system to its 
status prior to the occurrence of the fault (i.e., it treats the system as if the fault h ad 
never occurred). 

In setting up the system model, the user has the option of defining five different 
single-fault models (i.e. the user may select five different sets of model parameters a, 0, 
8, P, Pa. Pb)- In addition, the user may select a rate of entry (each with a Weibull dis- 
tribution) for each of the five single-fault models. Let Xj denote the jth single fault model 
for stage x. Then the rate of entry into the single fault coverage model is given by 

X(t|xj) = X(xj)w(xj)t“^ -1 . 

These rates of entry may be different for each of the 70 possible stages accommodated 
by CARE Iff. 

CARE m then aggregates the single-fault models associated with each stage into 
one single-fault model by OR-ing the A States, the B States, etc. of Figure 3.3. The 
resultant aggregate model is non-homogeneous. The aggregation is illustrated in Figure 
3.4. CARE III provides the additional option of allowing the user to select non-constant 
transition rates for 8, p and e, corresponding to uniformly distributed sojourn functions. 
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Figure 3.4. Combined Single Fault Model 




In order to illustrate CARE IE’s technique for defining single-fault models, con- 
sider a single-stage system that may experience two types of faults: permanent and inter- 
mittent. Hie corresponding fault-models are shown in Figure 3.5. For each fault type, 
the user must define the parameters a, P» €, p, Pjj. For the permanent fault, 
a = P = 0. For the intermittent fault, a * 0, 3 =£ 0. The user must also define the 
rate of occurrence associated with each fault type. This is done by selecting a pair of 
Weibull parameters for each fault type. Figure 3.5 also illustrates the aggregation of the 
two models into one single-fault model. For example, the aggregated transitions a(t), 
p(t) are derived from the two models as 



3.3.1.1.2 The Double-Fault Model 

A potential cause of loss of control is the occurrence of a fault in one component of a 
redundant set in close time proximity with a previous, but independent, fault in a dif- 
ferent component. These combinations are near-coincident faults and are only con- 
sidered potentially catastrophic if both faults are simultaneously either active or produc- 
ing an error. CARE HI accommodates near-coincident double-faults by allowing the user 
to designate which modules are vulnerable to double-faults (“critical pairs” in CARE 
HTs terminology). The modules may be paired within a stage or across two stages. 
CARE in, however, docs not handle near-coincident triple, quadruple, etc. fault combi- 
nations. The double-fault model is shown in Figure 3.6. 
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Figure 3.5. Corresponding CARE DDE Aggregated Single-Fault Model 
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Figure 3.6# Double-Fault Model of CARE m 


The fault-handling procedure is as follows: 

1. A single fault occurs. If a second fault occurs while the first 
fault is in one of the states A, A E or Bp, then CARE III 
assumes that this constitutes a system failure. 

2. If, however, the first fault is in the B state upon the occurrence 
of the second fault, then state A 2 B 1 of the double-fault model 
will be entered (A as the union of states A, A E , Bp of the 
single-fault model). If the detected state is entered, CARE HI 
will configure out the faulty module. 

It is important to note that the transitions of the double-fault model are completely 
determined by those of the single-fault model. Effectively, CARE HI assumes that the 
processes which cause the transitions of the single-fault model are independent across 
modules. The single and double-fault models are incorporated in combination in the 
CARE m stage representation as shown in Figure 3.7. It is assumed that near- 
coincident, double, critical faults always result in loss of system, and no accommodation 
is made for near-coincident triple or larger combinations of critical faults . 
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Figure 3.7. State Structure of a Stage as Represented by CARS m 

















3.3.I.2 Reliability Model 


In the reliability model, the aggregate states are indexed by a set L of fault vectors 
1 , where 


1= (1(1),«2),...,1(N)), 

l(x) = number of failed modules in xth stage, and 
0^1(x)^n(x) and l^x^N. 

The set L can be decomposed into two sets L and L such that the system is operational 
forle L and failed forle L and L = L(JL. The aggregate states can then be grouped 
into the sets H(l), G(l), and F(l) as follows: 


forle L: 


H(I) = 2<i(x.»)=l(x). IsisnJ 


forle L: 


GO) = 


(XdvUfi): 2d(x,a)=l(x), l^x^N and 

a 

£ does not specify any coverage failures 


F(D = 


(da,£): 2 d ( x » a ) =1 ( x )» and ) 

a 

£ specifies at least one coverage failure 

d 

H(l) is the set of states in which the system has failed due to spares exhaustion; 
GQ), the states in which the system is operational; and FQ), the states in which the sys- 
tem has failed due to coverage failure. Given that the reliability of the system at t 


is 


R(t) = P(system is in state G(l) at time t, leL) 
= P(X(t)=G(l)) 
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and letting P(t||) denote P(X(t)=Gfl)), Q(tH) denote P(X(t)=F(D), and S(tjl) denote 
P(X(t)=H(l)), the reliability of the system is 

R(t) = 2^ ID 

JrL 

= i - 2Q(tiD - £S(t[D. 

JfL *L 

3.3.2 Solution Method 

Given that for a fault vector 1 , l+l(y) is 1 with one more fault in stage y, the possi- 
ble transitions between the aggregate states are 

(a) lfL: G(I) to F(l), 

(b) lfL andl+l(y)eL: G(l) to G(L+l(y)), 

(c) if L and l+l(y)«L: G(l) to F(L+i(y)), 

(d) ifLandl+l(y)eL: G(l) to H(I+l(y)),and 

(e) JfL and l+l(y)eL: H(l) to Ha+l(y)). 

Note that there are no transitions from F(l) states since these states are absorbing. 

Denoting these rates as jx(t[D for (a); \W(t|l, l+l(y)) for (b); X< 2 >(tpL l*M(y)) for 
(c); and X. (t|l, l+l(y)) for (d) and (e), the forward differential equations for the system 
are 

£p('ID = -P(tlDx(t|i) + 

X 

4-QOID = P(tlOn(tll) + SP(«ll.-lW)X (2) («ll.-l(x)J),and (!) 

X v ' 

^•s(t|D - -s(tU>x*(tlo + SWtll-lW) + S(t [i-i(x))ix'(t li- 1W J)] , 

X 

where 
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and 


x(tli) = nOUO + 2x'(tlLi+i(x)) 

X 

x*(tU) = x(tU) - 

Considering the conditions governing transitions b and c, 

X<»(tU.l+l(y)) + ,X®(tU,l+l(y)) = ,andX'(tli.l+l(y)). 

Furthermore, due to the high reliability of the systems modeled by CARE HI, 

X^tll, l+l(y)) and X(t[l) must generally be much larger than X^(t|I» l+l(y)) and 

p.(tll), respectively. Therefore, 

x'Oll) = x(tli) - M-(tU) 

= x(tU) 

and 

X'(tpL l+l(y)) = X<')(tULl+l(y)) + X®(tU.l+l(y)) 

= xO)(tlL l+l(y)). 

For computation of system unreliability, it is necessary to compute the Q(t(l) occu- 
pancy probabilities. If the transitions jx(t , X^(t|JL l+l(y)), and X^(t[l, l+l(y)) were 
known, then the P(t[l)’s and, hence, Q(tjl)’s could be solved for by simple quadratures. 
Schematically, the aggregated states of the stage model are shown in Figure 3.8. Let 

Sit *2> — » s n = states of the 1-th operational model, 

Pi> P2t •••» Pn - corresponding occupancy probabilities , and 
&i, & 2 t —» 8a = corresponding exit transitions . 

It can be shown that 

Pi + P2 + * * * + Pm = “ 2®iPi + » 
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Pi states 


Pl+l states 



Figure 3.8. Schematic of Aggregated States of Stage Model 


39 





where G(t) is a linear combination of the occupancy probabilities of the previous (i.e, 1- 
1) operational model. It is important to note that G is independent of the occupancy 
probabilities of the 1+ 1 operational model. 

Accordingly, the aggregated occupancy probability Pj is 

P(tH) = Pi + P2 + • * * + Pn 
and 

= -P(t|])X(t|]) + G , (4) 

where 


X(t|]) = 


m 

S*® 

i-i 


m 


Spi 

i-1 


The required transition probabilities are 

Pi 

x^tlu+Ky)) = *£■— . 

2 R 


2®iPi 

X^(tULl+l(y)) = , 

2pi 


(5) 


( 6 ) 


(7) 


and 

p(‘ 10 = X(t[l) - X<»(t|l.l+J(y)) - X< 2 >(tU.l+l(y)) . (8) 

Since the computation of pi(t), P 2 (t) p n (t) would require the solution of the entire 

model, alternate expressions for the occupancy probabilities are obtained by solving the 
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model in isolation from the rest of the system. 

Assuming that the 1-th operational model was entered at time t, then, if the initial 
conditions are known, the occupancy probabilities of the internal states s 1( S 2 , ...» can 
be found. Let these probabilities be denoted by 

Then 

t 

Pi(t) = J ft(t-T)Prob[operational model was entered in (T,T+dT>] . (9) 

T 

Although the Prob[operational model was entered in (T,T+dr)] is not known, an approx- 
imate value can be found by assuming that coverage is perfect; i.e, that system failure is 
due entirely to exhaustion of components. Let P*_i(t) be die probability that the system 
is operational after 1-1 faults; then P]-i(t) can be obtained by combinatorial methods; i.e, 

P'(tll) = n P) I 1 - r (t|x)]'«[r(t|x)]“( , )- 1 ( I ), 
where r(t|x) = exp ~/^)X(u|xi)du I denotes the reliability of a module in stage x. Let 

. o i J 

“Hi be the rate of the next fault. In general, t)i will be a function of the fault rate 4 , X, of 
each module and 1; e.g., in a triplex voting system with spares, ni - 3X where X = the 
failure rate of a single module. Then, 

( 10 ) 

Prob[l-th model was entered in (t,t+c!t)] ~ Pi-iOOmOOdT 

* t represents global time; t - T, local tune 

4 this rate could be Weibull distributed, in which case T)j is also a function of t 
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(t)] is a function of t for a Weibull distribution). 

Note that P*-i(t) is the probability that the system has experienced exactly 1-1 faults at 
time t and has survived to time t. 

Accordingly, from (9), 

t 

Pi(t) ~ / ft(t-T)Pi'- 1 (T)ll 1 dT (11) 

T-0 

and, from (7), 

2 8 i(0 / ^i(t-'T)Pj , L 1 (T)t) 1 dT 

*< 2) (tH l+Ky)) « — , ( 12 ) 

2 / T )Pl- iW^dT 

•ni t-o 

and similarly for X^(t|JL l+l(y)). 

Thus, to obtain the desired transitions, 

(1) For each 1, compute Pi(t-r), fe(t-T), • • • . 

(2) Compute Pi-i(t)hi(t) • 

(3) Evaluate the integrals (11). 

(4) Compute the transitions according to (12). 

In actual practice, CARE m makes another approximation in the computation of 
K®(tU,l+Ky)): since (11) can be rewritten as 

Pi(0 ~ / ft(x)P 1 *_ 1 (t-x)n,(t-x)dx 

x-0 

and, in practice, Pi*-i(t) is a much more slowly varying function than ft(x), Pf-j can be 
approximated by 
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( 13 ) 


Pf-!(t-x) ~ a(t) + xb(t) + x^t) 

over the range of x in which ft(x) is significantly different from zero. 

Substituting these transition rates, which are perfect coverage rates, in the equation 
for P(t[]), yields the equation 

for the probability of 1 faults at time t, given perfect coverage. 

Replacing P*(t[l) for P(t[l) in the equations for Q(t[l) and S(t|]) results in 

QC*|l) = / P'(u|i)n(tU) + 

0 I z 

and 

s(t|D = P*(tU). 

Thus, CARE m can solve for the Q^iO’ 8 without fint solving for the F^'s and the 
reliability of the system can be computed as 

m = i - 2Q(t|i) - £p-(t|D. 

J*L *L 
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4.0 TEST CASES 


4.1 Introduction 

In order to evaluate how well CARE Dl and ARIES can be used to assess the relia- 
bility of AIPS architectures, it was necessary to determine how useful and applicable 
these tools are. Also, since each tool has inherent limitations, it was necessary to deter- 
mine how flexible each tool is with respect to accommodating systems that stress those 
limitations. Thus, several sample systems were selected to demonstrate the use of, 
ap pl ica bi lity, limitations, and relative accuracy of CARE DI and ARIES. These test 
cases do not test all of the features of each tool, nor do they attempt to verify the tools. 
In particular, they do not test all of the ARIES system types nor all of the pe rformance 
measures that it computes. For CARE m, the impact of state aggregation in assessing 
very large systems, the full use of the fault handling model, and non-constant failure 
rates were not tested. These test case results coupled with die AIPS requirements serve 
as a basis for a relative assessment of the two reliability modeling tools. 

The test cases range in complexity from a single processor architecture to a system 
suitable for flight control applications. Cases were selected to demonstrate relative 
strengths and weaknesses of the two modeling tools. Also, it was absolutely essential 
that accurate solutions could be obtained for the test cases. For each case a solution was 
obtained based on standard analysis techniques and subject to assumptions appropriate 
for the particular scenario and parameters. The solutions were calculated using simple 
computer programs. Other than the use of double precision variables, no spec ia l 
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numerical techniques were employed to ensure the accuracy of the calculations. Come- 
quently, the accuracy is limited to that inherent in double precision floa ting point arith- 
metic (64 bits) and in the numerical techniques used to compute the exponential func- 
tion. Under these conditions it was determined that computation of e“ x for X<10“ 15 
was subject to error. Since the probability of system failure on the order of 10“ 10 was of 
interest* the accuracy under these limitations was judged to be adequate. The test cases 
were then solved using CARE ID and ARIES and the three results compared. Due to 
limitations in either CARE m or ARIES, results from only one of these models could be 
obtained for some cases. In addition, wherever feasible, the test cases were described as 
both Type 1 and Type 7 for ARIES systems and the results compared. It should be 
noted that ARIES reliability results are normally reported to only seven significan t 
digits: to obtain results suitable for comparison, it was necessary to modify {he AIRES 
code to output 17. 

The test cases, solutions, results, and difficulties encountered are discussed in the 
following sections. 

4.2 Simplex Processor 

A simplex processor was analyzed to point out any computational* as opposed to 
modeling, differences. A constant failure rate of X was assumed. The probability of 
failure (unreliability) for this system is given by: 

P(SF) = l-e" Xt . 

For small Xt, 
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P(SF) « Xt . 

The results for each method are summarized in the following table: 


XI 

Direct Calculation 

ARIES 82* 

CARE in 

0 

0 

0 

0 

1.59 xlO" 23 

-2.78 x 10" 17 

0 

1.59 x 10" 23 

1.30 x 10 -19 

-2.78 x 10- 17 

0 

1.30 x 10- 19 

1 x 10" 16 

6.94 x 10" 17 

6.94 x 10- 17 

9.99 x 10“ 17 

5 x 10~ 16 

4.85 x 10" 16 

4.85 x 10" 16 

5.00 x 10" 16 

1 x 10- 15 

9.99 x 10~ 16 

9.99 x 10" 16 

1.00 x 10" 15 

1 x 10" 12 

9.99 x 10- 13 

9.99 xlO" 13 

9.99 x 10" 13 

1 x 10 -10 

9.99 x 10" 11 

9.99 x 10" 11 

1.00 x 10" 10 

1 x 10" 3 

9.995 x 10" 4 

9.995 x 10" 4 

9.995 x 10" 4 

1 

6.32 x 10" 1 

6.32 x 10" 1 

6.32 x 10" 1 


* ARIES reports reliability. Unreliability was obtained by subtracting 
ARIES reliability answers from 1. 


Observe that for Xt s 10“ 13 all methods give the same answer. Also, note that 
CARE m continues to provide accurate results for much small er values of Xt. 

The ability of CARE HI to provide accurate answers stems from computing unrelia- 
bility directly. Thus, computations involving very small differences between two 
numbers which are close to unity arc avoided. This, in turn, avoids approaching the 
accuracy limitations imposed by finite arithmetic. In fact, CARE m uses single precision 
arithmetic where the direct calculation method and ARIES use double precision arith- 
metic. 


While unimportant for this simple case, this distinction between CARE HI an d 


ARIES is important when analyzing more complex systems. Both accuracy and computa- 
tion resource requirements are issues. 


46 


4.3 TMR 


For the next case, a TMR system with no spares was chosen. The probability of 
system failure ( P(SF) ) for this system is given by 


P(SF) = 1 + 2e“ 3Xt - 3c-* 1 . 

The estimates of unreliability from the hand calculation, from CARE in, and from the 


two AR IES types agreed closely and are summarized in the following table: 


1 


Direct ARIES 1 ARIES 7 


CAREm 


0 


.10 

1 

5 

10 

7000 


0 

2.99996139042E-12 

2.99995001063E-10 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

.50512196468 


0 

2.99996E-12 

2.99995E-10 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

.50512196468 


0 

2.99996E-12 

2.99995E-10 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

.50512196468 


0 

2.9999939165E-12 

2.9999519535E-10 

2.9994993156E-8 

7.4937446470E-7 

2.9950040243E-6 

.50512194633 


4.4 M out of N 


Ihe next system considered was an M out of N system; i.e., one in which failures in 
M units out of N beginning units causes system failure. For this case, a seven out of 
twelve system with perfect coverage was chosen. A failure rate of 10~ 4 per hour and a 

mission time of eight thousand hours were assumed. Using the standard combinatorial 
solution, 


- k £ 

p = 1 - e~ Xt and 
q = e“ Xt , 

P(SF) of .5288303411826796 at t=8000 was expected. 
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In the initial attempts to solve this system as a Type 1, it was discovered that 
ARIES was not correctly computing systems with more than one degradation and no 
spares: the system would maintain perfect reliability for more than ten thousand hours 
before a sudden decrease of several orders of magnitude. In addition, the Type 1 results 
did not agree with the Type 7 results. Therefore, another modification was made to the 
ARIES code to produce reasonable reliability computations for the Type 1 system. Also, 
it was determined that an accuracy parameter had to be adjusted from its default value 
for an accurate computation of the Type 7 system. 

For this particular scenario, the ARIES Type 1 solution was .52883034118268013 at 
t=8000; the Type 7 solution was .52883034118268137 at t=8000. Likewise, the CARE 
IQ solution was .52883034118 at t=8000. A graph of the unreliability estimates from 
CARE m, ARIES, and the direct calculation is included in Figure 4.1. This graph illus- 
trates the close agreement among the three solutions for the computed time range. 

4.7 Quintuples 

For this system, a failure rate of 10” 4 per hour, a mission time of 10 hours, per- 
manent faults, and imperfect coverage were assumed. System failure was defined to be 
the occurrence of four or more faults or the occurrence of a sufficient number of faults 
to preclude forming a majority from the remaining active processors, hi defining the 
single fault model, it was assumed that (1) a fault is detected immediately as it produces 
an error, (2) single point faults arc excluded and (3) only two concurrent active faults 
can cause system failure. Thus, for the CARE HI single-fault model shown in Figure 
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3.3, P/^, Pg c, c, 8 and p had to be selected consistent with these assumptions. It was 
also necessary that the selected parameters would result in a double-fault model con- 
sistent with these assumptions. Thus, the parameters were defined as follows: 


p A ~ p b = !> 

« = 0 , 
c= 1, 
p = 0, and 
8 = 3600/hour. 


The resulting double-fault model is shown in Figure 3.6. 


This system can be represented by the Markov model shown in Figure 4.2, where A 
is a single fault and AA is a double fault. In this model the path 5 good • A • AA • SF 
exposes the system to a triple fault during the recovery period, so that the probability of 


loss of three out of five in time t is approximately 30-^-t. Since the Hmc spent in 


recovery is small relative to the failure rate, the fault recovery states can be approxi- 
mated by ins ta n ta ne ous coverage. With the instantaneous coverage approximations, the 
model can be represented by the simplified model of Figure 4.3. 


v 3 

Since probability of loss of three out of five in time t is approximately 30 -^-t, triple 


coincidences can be considered remote and can therefore be ignored. Thus, the dom- 
inant path to SF due to lack of coverage is 5-4-SF. Using I .aplace transforms, 


Lp[Pccv(SF)] « | (s^-jc.d-cj). 
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By Partial Fraction Expansion and talcing the inverse transform, 

Pcov(SF) ~ ^(1-02) |l-5e _4Xt + 4e _5Xt ] 

The exhaustion of components is approximately the failure of four out of five pro- 
cessors. Thus, 

P(SF) « ~ |l-5e“ 4Xt + 4e“ SXt ] + 5 (l-e- Xt )V Xt + (l-e" Xt ) 5 . 

For small Xt, 

P(SF) a + 5(l-e-»‘)V kl + (l-e- kt ) 5 , 
so that at time t=10, P(SF) = 5.81936ilO~ 12 . 

For the CARE m analysis, this system was described as a one-stage system consist- 
ing of five active modules and requiring a minimum of two fault-free modules for con- 
tinued operation. The system fault tree was described as consisting of one input anrf one 
output, where the output, system failure, is contingent upon the failure of the stage. A 
critical pair fault tree was also included specifying critical pairing between every two of 
the five stage modules. Since CARE m does not allow for the triple fault, the Pcov(SF) 
computed by CARE m is dominated by the Q(2) probability, i.e., the probability of 
failure after two faults, and the P(SF) is therefore overestimated. However, taking the 
Q(3) component of the Pcov(SF) computed by CARE m, which corresponds to the 5-4- 
SF coverage path of the model, and adding this to the P* (P exh (SF) assuming perfect cov- 
erage ) component computed by CARE m, yields 

P(SF) = 8.221588319xl0~ 12 + 4.9860181088xl0“ 12 = 5.8181769407xl0~ 12 
at time t = 10. 
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There is no transition in the ARIES model to correspond to the transition from 5 
good to 3 good in this case. Thus, to construct an ARIES Type 1 description of this 
case, the system was approximated by subsuming the S good to 3 good path into the S 
good to 4 good path. Using this approximation, the P(SF) was computed to be 
5.81801x10"" 12 at time t = 10. 

The full Markov model was initially used to construct an A RIES Type 7 system and 
the P(SF) was computed to be 5. 83694x10“ 12 at time t=10. However, when this same 
model was used with the states indexed so that the transition-rate matrix was upper tri- 
angular rather than tridiagonal, the P(SF) was incorrectly computed. The results from 


these two Type 7 descriptions are compared to 

the direct calculation in the following 

table: 

i 

Direct Calculation 

ARIES Type 7 

ARIES Type 7 Re— Indexed 

0 

0 

0 

1.826535678 x 10“® 

l 

8.83x10“ 13 

1.83 x 10“ 15 

1.825623497x10“® 

5 

5.2039 x 10“ 13 

5.1459 x 10’ 13 

1.822027046 x 10“ 8 

10 

5.81936 xlO" 12 

5.83694 x 10“ 12 

1.818007445 x 10“® 


Although the initial Type 7 estimate agrees fairly well with the direct calculation for 
t=10 hours, the re-indexed model yields completely inaccurate estimates. This demon- 
strates ARIES* sensitivity to the ordering of states. 

Finally, the instantaneous coverage Markov model was used to construct an ARIES 
Type 7 system. For thi3 model, the P(SF) was computed to be 5.81907x10“ 12 at time t = 
10. The graph included in Figure 4.4 illustrates the dose agreement among the esti- 
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mates from this Type 7 description, the Type 1, the CARE m, and riie direct calcula- 
tion. 

4.6 TMR with Powered Spares and Permanent Faults 

The fifth system considered was a TMR with two powered spares and permanent 
faults. For this system, a failure rate of 10“ 4 per hour, a mission time of 10 hours, and 
imperfect coverage were assumed. Id defining the single fault model, the parameters 
were selected as before so that 

Pa = P B = 1, 

« = 0 , 

C — 1, 

p - 0, and 
8 = 360(Vhour. 

This system can be represented by the Markov model shown in Figure 4.5. Using 
in sta n ta neo u s coverage, the model can be represented by the simplified model shown in 
Figure 4.6. Since the dominant path to SF due to lack of coverage is (3,2) - SF, 

PcovCSF) « (1 c)(l - e“ 3Xt ). 

P(SF) due to exhaustion of components is the probability of loss of four out of five, so 
that 

P«h(SF) ~ 5(1 - e" Xt ) 4 c“ Xt + (1 - e“ Xt ) 5 . 

Thus, 

P(SF) ~ + 5(1 - e“ Xt ) 4 e“ Xt + (1 - e“ Xl ) 5 . 

Then for X = 10“ 4 , p, = 3600 per hour and t = 10 hours, P(SF) = 1.7165269x10“ 10 . 
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QUINTUPLEX 



Model 



Figure 4.5. Markov Model for Test Case 5 
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Exhaustion of Components 
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(1-c) 


2 \ _ 2 \ 
2X4-8 “ 8 


Figure 4.6. Instantaneous Coverage Model for Test Case 5 





For the CARE m analysis, this system was described as a one-stage system consist- 
ing of five active modules and requiring a minimum of two fault-free modules for con- 
tinued operation. The configuration of this system into three active and two powered 
spare units was specified by means of the NOP parameter. The system fault tree was 
described as consisting of one input and one output, where the output, system failure, is 
contingent upon the failure of the stage. A critical pair fault tree was also included 
specifying critical pairing between every two of the five stage modules. With this system 
description and assuming an active unit failure rate of 10“ 4 and a mission time of 10 
hours, CARE m computed the P(SF) to be 1.7191249813xlO" 10 . 

For the ARIES analysis, the system was described first as a Type 7 and then as a 
Type 1. The instantaneous coverage Markov model was used to construct the transition 
matrix for the Type 7 analysis and the P(SF) was computed to be 1.716543x10" 10 at time 
t = 10. For the Type 1 analysis the system was described as starting with three active 
units and two spares and able to sustain one degradation (or reconfiguration). The 
active and spare failure rates were specified to be 10“ 4 per hour and the coverage 
parameters for the possible system configurations were computed from the instantaneous 
coverage Markov model. With this system description, the P(SF) was computed to be 
1.7165291x10" 10 at time t = 10. 

A graph of the results is included in Figure 4.7a. This graph illustrates the dose 
agreement among the estimates from CARE m, ARIES, and the direct calculation. 
Figure 4.7b shows the results obtained from an earlier version of CARE HI. CARE in 
estimates oscillate and are offset from the direct calculation. 
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4.7 TMR with Unpowered Spares 


For the sixth system, a TMR with seven unpowered spares and permanent faults 
was chosen. A failure rate of 10" 4 per hour, a mission time of 10 years = 87,600 
hours, and imperfect coverage were assumed. The single fault model parameters are the 
same as for Test Case 4. 


Since the spares are unpowered, it was assumed that the failure rate for a spare is 
zero until that spare is switched in to replace a failed active module. After the spare 
becomes active, its failure rate is the same as that of an active module. This sys tem is 
represented by the Markov model shown in Figure 4.8. Approximating the fault 
recovery states with instantaneous coverage as before, the model can be represented by 
the simplified model in Figure 4.9. 

Using Laplace transforms, the P(SF) due to lack of coverage is 


Lp[Pcov(SF)] 


so that 


3X(l-c) 1 3Xc 

S [S+3X (S+3X) 2 

3X(l-c) 

S(S+3X) * 


I PM 7 
(S+3X) 8 


P«v(SF) a [l-e-»<] . 
Likewise, the P(SF) due to exhaustion at components is 

- sSr(^r • 

By partial fraction expansion for repeated roots and Lp -1 , 
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P«h(SF) = l-3*c-^ + c~ 3Xt 


6560 + 2186(3X)t + 728(3X) 2 ^ + 242(3X) 3 -t- + 

2 6 


80(3X)<£ + 26(3X) 3 1 |- + 8(3X)«^ + 2<3X)^ 


Hie repeated roots in the solution for exhaustion of components are a result of the 
unpowered spare assumptions. In the expression for P^SF) derived by expanding 
these roots, all terms up to the ninth power cancel. This cancellation causes computa- 
tional problems in the first one thousand hours. In the Markov model for this case, the 


unpowered spare assumptions result in a transition-rate matrix with non- distinct eigen- 
values. Since the ARIES solution method is based on an assumption of Hwtinrt eigen- 
values, this case also causes computational problems for ARIES. 

In an ARIES Type 1 system, spares are assumed to be unpowered if the spare 
failure rate, p,, is less than the active module failure rate, X. Because of the distinct 

eigenvalue restriction, p. must be greater than zero and p, s — - For this test case. 

Hr 

the modified version of ARIES can compute reliability with p, = while the unmodi- 
fied version can compute reliability with p. = — ^ (but with computational errors for t 
< 10000 hours). 

Since ARIES will not allow p, to be zero, it overestimates the unreliability as com- 
pared to the direct calculation. Also, making p, as small as ARIES would accept for this 
ease (to m inimize the overestimation) resuited in computational errors for t < 10000. 
The overestimation and the computational errors are illustrated in the graph of the 
results included in Figure 4.10. 
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TMR with Unpowered Spares 




la the Type 7 system, the transition-rate matrix is entered directly by the user and 
any non-distinct eigenvalues are dropped from the computation. Thus, in this case with 
eigenvalues of — 3X. (occurring 8 times) and — 2X, die duplicates are dropped so that the 
system is solved with only two eigenvalues, -3A. and -2X. As a result, the solution for 
this system is the same as that for a TMR with no spares, and the estimates of unrelia- 
bility cannot agree with those from the Type 1 and the direct calculation. The graph of 
the Type 7 estimates is included in Figure 4.10 for comparison with the other graphs. 

Since CARE HI assumes that spares are powered, it was not possible to use CARE 
m for this case. 

4.8 AIPS-LIke FCS 

For the seventh system, a very simple AlPS-like FCS was chosen to highlight the 
assumptions required to use ARIES and CARE m to estimate the reliability of an 
AlPS-like architecture. The system shown in Figure 4.11 was assumed to consist of 
eight sets of quad sensors, eight sets of quad activators, and two triplex processors. 
Failure rates of 10“ 4 per hour per sensor, 10 -4 per hour per actuator, and 10~ 3 per hour 
per processor; perfect coverage for the sensors and actuators, imperfect coverage for the 
processors; permanent faults; and a mission time of 10 hours were assumed. The single 
fault model parameters are the same as for Test Case 4. The system was to 

operate as follows: 

• After loss of the triplex processor set (three faults), its functions are performed by 
the second triplex set, provided that it is still functional. 
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Figure 4.11. Fault Tolerant Flight Control System 


• Hie second triplex set was formerly performing non-critical functions and was not 
vulnerable to critical fault pairs. 

System failure occurs if and only if 

• a sensor set is lost, 

• an actuator set is lost, or 

• the processing function is lost; i.e., two of the first triplex set arc lost or two of the 
second triplex set are lost. 

In this system, the two triplex sets simulate FTMP and the reversion to the second 
triplex set simulates functional migration. It was assumed that functional migration is 
always successful. Point-to-point wiring, i.e., a 100% reliable network, was assumed. 
Since triplex subsystems are considered triple, near-coincident faults are not a factor in 
system failure. Also, sequence-dependent faults are not a factor in system failure 
because of the reliability of the bus network. 

Tbc solution was obtained by decomposing the system into independent subsystems 
so that 

P(SF) = P(Es) + P(Ea) + P(E Pl E Pl ) , 

where 

Es = Event of loss of a sensor set (1 of 8), 

E a “ Event of loss of an actuator set (1 of 8), 

E Pj = Event of loss of primary processor set (1 of l),and 
E Pj = Event of loss of backup processor set (1 of 1) . 

Since a sensor set is lost when three out of four sensors in the set are lost, and there are 
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eight sets, 

P(Es) = 32XJ1 3 . 

Likewise, 

P(Ea) = 32X^ . 

Since the processor sets are triplex, loss of two results in loss of the set. It is immaterial 
whether two faults are nearly coincident in this scenario. Thus, the single fault model is 
unnecessary and 

PflB^Ep,) = PCEp^PCE^ 

= <3xft9(3x|i?> 

= 9X^t 4 . 

Thus, 

P(SF) ~ 32X<|t 3 + 32\Xt 3 + 9X$t 4 , 
so that at 10 hours P(SF) = 1.54E-7. 

For the CARE HI analysis, the system was described as an 18-stage system 
represented by the system fault tree in Figure 4.12. With this description, CARE HI 
computed the P(SF) at 10 hours to be 1.5090886052E-07. 

For the ARIES analysis, this system had to be defined as a series configuration of 
homogeneous subsystems. It was therefore necessary to combine the two processor sub- 
systems into one subsystem to accommodate their particular configuration; the Markov 
model in Figure 4.13 describes the combined subsystem. Note that the Markov model 
for the combined subsystem contains more states than the two separate subsystem 
models would. A transition rate matrix for a type 7 system was constructed from this 
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second triplex 


Figure 4.13. Markov Model for Processors of Test Case 7 
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Markov model. Hie complete system was then solved by ARIES as a series configura- 
tion of 16 type 1 (sensors and actuators) and one type 7 subsystems, resulting in a P(SF) 
of 1.5090900654E-7 at 10 hours. 

A graph of the results is included in Figure 4.14. 
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5.0 Assessments and Conclusions 

5.1 Objective and Leading Particulars 

The objective of this section is to assess the potential benefits and limitations of 
CARE III and ARIES 82 when used for advanced fault tolerant systems applications. 
These benefits and limitations reported here were identified following a review of the 
AIPS requirements, a study of the models upon which the tools were based, and applica- 
tion of these tools to the test cases described in Section 4.0, as well as to other simple 
systems. 

It is expected that the results of this investigation will provide guidance for planning 
future reliability modeling research and development activities at NASA-LaRC. To this 
end, it is important to recognize and understand both the potential benefits and limita- 
tions of these tools. Understanding of these issues will help prevent misapplication of 
the tools. Limitations with respect to application of these tools to advanced systems 
could be eliminated by improvements or by the development of new tools. 

The observations and comments regarding these tools fall into three categories. 
The most important category includes issues which have a clear and direct impact on the 
capability to effectively represent advanced fault tolerant system configurations. 
Another category includes issues which are likely to impact application of these tools to 
advanced fault tolerant systems. Most items discussed will fall into these categories. 
Finally, the utility of automated tools often is limited by the demands placed upon the 
user. Consequently, a category for user-related issues is included. 



It should be noted that limitations have been identified for use of these tools to 
model advanced architectures, in a wide range of aerospace applications. In order to not 
seem unduly negative, these results must be viewed within die context of the applications 
for which these tools were originally developed. Further, the significance of a particular 
limitation must be judged by the importance and scope of the advanced architectural 
feature that creates the limi tation. 

5.2 CARE m Assessment 

CARE m is the most recent in a series of reliability assessment tools developed by 
NASA LaRC. It was designed primarily for analyzing ultrareliable flight control sys- 
tems. It is described as a general purpose reliability analysis and design tool for fault 
tolerant systems and it is capable of handling large highly reliable systems. A fault han- 
dling model is used to model detection, isolation, and recovery processes. CARE ITT 
provides a variety of stationary and nonstationary fault and error models. These include 
permanent, transient, intermittent, design, latent, and software faults or errors. CARE 
m features a user-oriented fault tree language for describing complex system configura- 
tions and success criteria. [11] [12] 

A number of CARE ID’s characteristics would be useful for analysis of advanced 
systems. The most important is the capability to analyze large systems. In the fault 
occurrence model, CARE HI can handle up to 70 stages as well as 2000 total events and 
70 input events. Input events are the lowest level input to gates in the fault tree and 
other events are inputs or outputs at higher levels in the fault tree. A stage may 
comprise one or more modules. Each stage with replicated modules is treated as an M 
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out of N subsystem. When coupled with the options for multiple fault handling models, 
very large and complex systems can be modeled. CARE m accomplishes a large state 
reduction by decomposition and aggregation techniques. The fault occurrence and fault 
handling parts of the model are decomposed under the assumption that there are several 
orders of magnitude difference between the fault occurrence and the fault recovery 
rates. This is referred to as a temporal decomposition. Further decomposition and 
aggregation occurs when states across stages are aggregated based on the fault tree and 
the critical pair tree. This is somewhat s imilar to some of the structural decomposition 
and aggregation techniques used in other reliability tools. [3] [13] 

The flexible fault handling/coverage/double fault features of CARE IQ distinguish it 
from other reliability analysis tools. For applications where mission duration is short 
relative to the time between failure occurrences, system failure due to failures in fault 
handling or critically coupled double faults during recovery may be significant relative to 
failure by exhaustion of components. In such applications, the capability to model the 
fault handling and recovery processes should be important. 

CARE UPs capability to model nonconstant failure rates (Weibull distribution) also 
distinguishes it from some other reliability modeling tools. This feature is useful for sys- 
tems that contain components subject to wear out, such as mechanical actuators anrf 
some electronic components, and possibly for electronic systems subject to radiation 
exposure. For applications where nonccnstant failure rates and failures due to com- 
ponent exhaustion are significant, CARE QTs features could prove useful. 
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CARE m has undergone extensive testing and verification. The numerical accuracy 
for extremely reliable and ultra-reliable reliable systems should be adequate. 

hi summary, the CARE HI features that have potential value for reliability analysis 
of advanced fault tolerant aerospace systems are the capability to handle large systems, 
the somewhat flexible fault handling model, the capability to have nonconstant fault 
occurrence rates and the capability to model near coincident failures by the critically cou- 
pled pair or double fault model. 

A number of CARE ETs limitations with respect to AIPS applications stem from 
space missions of long duration. As noted earlier, Care HI was specifically designed to 
evaluate reliability for air transport flight control systems. Mission durations are consid- 
erably shorter in these applications. Emphasis will shift from fault handling failures to 
exhaustion of components failures. The product of mission duration and failure rate 
changes by several orders of magnitude. As a result, approximations used in the CARE 
m computations could no longer be valid. Also, the longer mission interv als and the 
need to conserve power or weight in space applications can dictate the need to use 
unpowered spare modules. Presumably, these modules, while unpowered, would have 
lower failure rates than their powered counterparts. 

In Section 4.6 it was indicated that CARE m requires that spare modules have the 
same failure rate as active modules. Finally, some space applications will operate as 
open systems, i.e., maintenance will be permitted. CARE m only models closed 
(maintenance free) systems. 
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The need to model sequence dependent failures sometimes arises when fault 
tolerant systems are considered. One of the more important cases for AIPS stems from 
die function migration concept. The concept can allow a function to be carried out by an 
alternate processing site (system resource) when the processing site initially used for the 
function fails. The capability to migrate the function could depend upon a fault-free 
intercomputer network or a mass memory resource. If the failure of the initial process- 
ing site occurs prior to the loss of the mass memory, the function can be migrated suc- 
cessfully. If the mass memory failure precedes the loss of the initial processing site, the 
function will be lost. In the function migration cases, reliability analysis may focus on 
exhaustion of resources rather than coverage failures. CARE m can be used to bound 
the effects of sequence-dependent failures. However, the sequence dependence failure 
modes introduced by function migration could require better capability in this area. 

In Section 2.0 the need to analyze the reliability of large nodal communication net- 
works was identified. CARE m cannot model these networks. It should be noted that 
tools to analyze reliability for these networks have not been developed. 

Several potential limitations of the CARE m fault handling model have been identi- 
fied. These are 

1. The fundamental assumption that sojourn times in the fault han- 
dling model are small relative to the time between fault occurrences 
may not be valid for latent faults or for some intermittent faults. 

2. The fault handling models used are independent of system state. 

For some systems it may be realistic to expect coverage to 
deteriorate as system resources are reduced. CARE HI can be used 
to bound the reliability of such systems. 
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3. The fault handling model is constrained to a single entry state, to 
have identical transition rates (a, 0) between active and benign for 
faulted and error-producing states, and transitions between some 
states of the model are omitted. These are flexibility issues of more 
interest for research purposes. 

4. The double fault model is conservative. A system failure results if 
two critically coupled faults occur even though neither has produced 
an error. This assumption could result in a too conservative predic- 
tion when faults of long latency periods are present, e.g., software- 
dependent hardware faults. 

Multiple near coincident faults, multiple faults that occur within the fault handling inter- 
val following the occurrence of the first fault, of order higher than two cannot be 
modeled by CARE ILL This case was demonstrated by the quintuples example of Sec- 
tion 4.0. As indicated, the reliability for this simple system could be obtained indirectly 
from CARE ID analysis of a TMR with two spares. The quintuplex configuration, an 
important fault tolerant configuration, is not presently used in AIPS, but that is not to 
say that critical triples will not arise in any AIPS applications nor should one expect the 
quintuplex to be absent from other advanced fault tolerant systems. Further, it should 
not be inferred that the indirect method using CARE m will work satisfactorily for 
more complex configurations or where other critical triples arise. 

CARE IQ calculates reliability based on the assumption that the probability that 
there are no failed modules in the system equals 1 at t=0. Perfect dispatch reliability 
can be approached but not obtained for complex systems. For extremely reliable sys- 
tems, very high dispatch reliability is required. Consequently, the capability to set the 
initial state occupancy probabilities to values other than “perfect dispatch” is highly 
desirable. 
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During this investigation, several individuals of differing backgrounds have learned 
to use CARE HI. The earliest user learned at a time when CARE m was being vali- 
dated and modified and at a time prior to the publication of the user’s guides. During 
this period, results were sometimes suspect due to the status of CARE III modifications. 
Presently, users are confident of CARE III results. The new user’s guides have speeded 
the learning process and represent a quantum improvement in the documentation. 

5.3 ARIES 82 Assessment 

ARIES 82 is an interactive, unified reliability modeling tool developed by Ng and 
Avizienis at UCLA. It models systems which are composed of a series of independent 
homogeneous subsystems each of which can be modeled as a finite-state, continuous 
parameter, time-homogeneous Markov process. Limited state aggregation is achieved by 
analyzing the independent subsystems and combining the results. Fault handling is 
assumed to be instantaneous and it’s effects are captured by constant coverage probabili- 
ties which depend upon system state. As indicated in Section 3.0, ARIES 82 can be 
applied to a wide range of system scenarios. 

The features of ARIES 82 which are of potential benefit to advanced fault tolerant 
system studies are 

1. The capability to model closed or open systems. 

2. Spare modules can have failure rates that are different than active 
module failure rates. 

3. A state transition matrix can be used to describe a system. 

4. An interactive user interface. 


81 



Some of the limitations of ARIES 82 are 


1. I nstantaneous coverage may not be adequate for modeling some sys- 
tems. When fault handling times are small relative to the time 
between fault occurrences, this simple model is often adequate. 

2. Constant failure rates are not adequate for modeling cer tain com- 
ponents of aerospace systems. 

3. System sizes are limited to relatively small systems. 

4. The accuracy of the results are suspect for highly reliable systems. 

Accuracy limitations are noted several times in Section 4.0. 

5. The eigenvalues of the state transition matrix must be distinct. Re- 
peated eigenvalues can occur, for example, when spare failure rates 
are zero until they are activated. 

The accuracy limitations are restrictive. ARIES 82, in contrast to CARE m, com- 
putes reliability instead of unreliability. For very reliable systems, this approach stresses 
the numerical accuracy of the host computer and is the source of some of the accuracy 
problems. Also, ARIES 82 normally reports only 7-digit results. The sensitivity of the 
results to the order in which system states are indexed was noted in Section 4.0. Also, 
nearly distinct eigenvalues can lead to accuracy problems. 

I 

ARIES 82 has been in use as a tool to support university teaching and research. 
But ARIES 82 has not undergone a rigorous validation process. Even for the relatively 
few and simple cases run for this study, at least two progr ammin g errors that produced 
erroneous results were found. 

Learning to use ARIES 82 was judged to be somewhat simpler than CARE HI. 
This was due, in part, to the relative simplicity of the ARIES 82 model. 
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S.4 Conclusions 


A number of useful features were recognized in ARIES 82. Accuracy limitations, 
lack of formal validation, the presence of programming errors, the lack of product sup- 
port, and the limitations on system size combine to make ARIES 82 unsuitable for 
modeling advanced fault-tolerant systems. 

CARE m was found to have features desirable for modeling advanced system 
architectures. Among these were the capability of handling large systems, a somewhat 
flexible fault handling model, nonconstant failure rates in the fault occurrence model, 
die provision for near coincident double faults, the computational accuracy required for 
analyzing ultrareliable systems, and a user interface which provides for simple and flexi- 
ble system definition. 

A number of CARE HI limitations were identified. Among the more important 
system scenarios which were difficult to model or could not be modeled using CARE m 
were 

1. Systems with unpowered spares, 

2. Systems where equipment maintenance must be considered, 

3. Systems where failure depends on the sequence in which faults oc- 
curred, 

4. Systems where multiple faults greater than a double near coincident 
fault must be considered, 

5. Systems containing large nodal communications networks that have 
a significant impact on system reliability, and 

6. Systems where less than perfect dispatch reliability must be con- 
sidered. 
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Subject to constraints dted in paragraph 5.2 and repeated below, CARE HI is best 
suited for evaluating the reliability of advanced fault tolerant systems for air transport. 
Characteristics of systems for which CARE m is best suited are 


1. The mission time is short relative to the time between failure oc- 
currences. That is, coverage failures dominate exhaustion of com- 
ponent failures. 

2. The fault recovery time is short relative to the time between failure 
occurrences. 

3. Either the network reliability cannot impact system reliability or the 
network can be treated as an independent subsystem whose reliabili- 
ty can be determined by other means. 

4. Near coincident multiple faults of order greater than two are not 
relevant. 

5. System reliability should be in the extremely to ultrareliable regime. 
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APPENDIX 



FINAL REPORT 

NASA CONTRACT # NAS1-16489 
TASK 16 


COMPARATIVE ANALYSIS OF CARE HI 
AND ARIES 82 FOR RELIABILITY 
ANALYSIS OF AIPS ARCHITECTURE 


March 15, 1985 


Robert Baker 
Charlotte Scheper 


Research Triangle Institute 
Research Triangle Park, NC 27709 


SCOPE OF WORK 


Learn AIPS and Determine Suitability of 
CARE m and ARIES for AIPS Analysis 

Compare CARE m and ARIES 

• 1 l 

Apply CARE m and ARIES to "AIPS Like" 
Architectures 

Identify Limitations and Recommend 
Refinements 


Document 



AIPS OBJECTIVES 


Design a fault and damage tolerant system 
architecture which satisfies real-time data pro- 
cessing requirements for aerospace applications 


Develop support methods for design, evalua- 
tion, and verification 
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AIPS APPLICATION REQUIREMENTS 



Mission 


Thru put 

" Memory 


I/O Rates 

COMMERCIAL 

AIRCRAFT 

lOhrs 

io ~ 9 

5.5 MIPS 

2MB 


750 Kh/s 

TACTICAL 

MILITARY 

AIRCRAFT 

4 hrs 

10" 7 

6 MIPS 

1MB 


1 Mb/s 

UNMANNED 

SPACE 

PLATFORM 

5 yrs 

10- 2 

2 MIPS 

■ 

750 KB 

150 Kh/s 

UNMANNED 

SPACE 

VEHICLE 

lwk 

1 

10" 6 

.5 MIPS 

■ 

300 KB 

1.5 Mb/s 

DEEP SPACE 
PROBE 

5 yrs 

10-2 

.5 MIPS 

300 KB 

1MB 


MANNED SPACE 
PLATFORM 

20 yrs 

10-2 

15 MIPS 

20 MB 


15 M/bs 

MANNED SPACE 
VEHICLE 

10 days 

10- 7 

1.5 MIPS 

3MB 

3MB 

i 

lMh's 

RATIO MAX7MIN 

40K 

10 7 

30 

60 

1000 

100 
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AIPS SYSTEM ATTRIBUTES 
(QUALITATIVE REQUIREMENTS) 


• Growth and Change Tolerance 

• Accepts Technology Upgrades 

• Graceful Degradation 

• System Complexity Transparent to User 

• Graded Redundancy 

• Damage Tolerance 


AIPS BUILDING BLOCKS 


FTMP 

FTP 

Intercomputer Network (Fault and Damage Tolerant) 
I/O Network (Fault and Damage Tolerant) 

Fault Tolerant Mass Memory 

Fault Tolerant Power Distribution System 

Network Operating System 
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SOME KEY FAULT TOLERANT FEATURES 
OF AIPS ARCHITECTURE 

• FTMP and FTP Concepts 

• Hardware Redundancy 

• Redundant Elements in Tight Synchronism 

• Fault Detection and Making Implemented 
in Hardware 

• Fault Isolation and Reconfiguration 
in Software 

• Layered Communications Network 

• Source Congruency 

• Function Migration 
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SOME ALPS ARCHITECTURAL FEATURES, CONCEPTS, 
AND APPLICATION REQUIREMENTS IMPACTING 
RELIABILITY ASSESSMENT 


• Function Migration 

• Partial Cross-Strapping 

• Large Networks 

• High Degree of Fault Tolerance 

• Both Long and Short Mission Times 
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ORIGINAL PAGE <3 
Of POOR QUALITY 


f 12 Links 


l 9 Nodes 


3 Failures for L.O.S. 


2 F allures Loss Nodes 


f 15 Links 


10 Nodes 


Simple Network 


SI 


S2 


< r~ ■ h 

Pi 

P2 

Al 

>- i 

— i 1 

A2 

» ! — a 


S3 


P3 


A3 


Sensors 


Processors 


Actuators 


Alternate Network 


X 



3 Failures to Isolate Nodes 
rt Failures for Ix>ss of Svstem 



o 


Network Reliability Does Not 
Impart System Reliability 
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CARE m FEATURES AND ATTRIBUTES 

• Designed for Ultrareliable Flight 
Control System Analysis and Design 

• Handles Large Systems 

• Large Reduction of State Space Via Aggregation 

• Fault Handling Model {Permanent, Intermittent, Transient} 

• Exponential and Wiebull Failure Rates 

• Double Fault Model 

• Analyze Closed Systems 

• Fault Tree Input 
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Maximum Number Of Faults A Stage Can Sustain 
And Still Be Operational 



AN ALTERNATE STATE REPRESENTATION OF A STAGE 


Fault Handling Model of CARE El 




Double-Fault Model of CARE HI 




3 

rt 

U. 


o 

O 


s 


ORIGINAL PAGE IS 
OF POOR QUALITY 







— U) 
rt rt 

• ” o 

T± </» 


V) 

£ g 

u> 

CO 

' « S 

+ 

O S 

•o CL *2 

ia 

_ o 

flj y 


>v v 

o w rt 

< 

*2 <u 

£ "5 a. 

+ 

£ o 
“ o 

« n *> 


3 *1 

Q u. (X 

< 

/vQ. 


II II II 

• c 


103 


State Structure of a Stage 
/Vs Represented by CARR III 

















ARIES 82 FEATURES 


Designed and Used for University Reliability Projects 
Markov Model 
finite-state 

continuous-parameter 

time-homogeneous 

State Aggregation: Limited Structural Decomposition 
Instantaneous, State-Dependent Coverage ‘ 

Constant Transition Rates 

i 

Transient and Permanent Faults 

Spares 

powered 

unpowered 

blocked 

Parametric Description for Six Basic Systems 
Matrix Description for General Systems 
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ARIES SYSTEMS 


Type 1 Closed FT System 

Type 2 Closed FT System with Transient Fault Recovery 

Type 3 Mission-Oriented Repairable System 

Type 4 Repairable System with Transient Fault Recovery 

Type 5 Repairable System with Restart 

Type 6 Periodically Renewed Closed FT System 

Type 7 State Transition Rate Matrix 

Types 1-6 are fixed models instantiated by user- 
specified parametric values. 

Type 7 accepts a user-defined transition-rate matrix 
describing the complete system. 
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TYPE 1 


Closed FT System with permanent faults. 

No external repair or renewal. 

System has spares and can degrade after spares are 
exhausted. 

Ability to degrade can be blocked by unrecoverable 
spare failures. 

Standby spares periodically tested. 

Spare selection is predetermined. 

A failed module is removed from system. 
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TYPE 1 PARAMETERS 

D Number of Degradations 

S Number of Spares 

CS Spare Coverage 

X Active module failure rate 
p Spare module failure rate 

X Active resource vector 

CX Coverage vector 

X = (A, A-l, . . A-D, A-(D+1)) 

^A-l,. • • » Qa-D, ^A-(D+1)) 


TYPE 1 PARAMETERS — RESTRICTIONS & ASSUMPTIONS 


CS<1 

=> (A,S,D) states can be entered 
models blocked spares 

X and p. 

are constant 

n = X 

=> powered spares 

0<p,<X 

=> unpowered spares 

x/ft £ 10 6 

(as specified) 

CY[0] 

is coverage for all transitions from 
states possessing spares 

CY[D+1]=0 

when no safe shutdown state 
is provided 
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TYPE 7 GENERAL MISSION-ORIENTED FT SYSTEMS 


Any system that can be represented by a single state 
transition-rate matrix Q of the form 



where is the transition rate from state i to state j, and 
qi is the rate from state j to state j 

Matrix can be input to ARIES symbolically or with actual 
numerical values 


no 


TYPE 7 (cont.) 


System states can be partitioned into 5 disjoint subsets 

Full Capacity (FC) 

Degraded Capacity (DG) 

FC with Blocked Spares (FCB) 

Safe Shutdown (SS) 

Crash Failure (CF) 
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SOLUTION 


Assumptions 

System is finite-state, continuous-parameter Markov process. 
Markov process is time-homogeneous. 

Transition-rate matrix has distinct eigenvalues. 

Solution 


Q = 


H i= j 


Transition -Rate Matrix 


p (0 “ (pi(t) »P 2 (t)> • • -tPnCO ) State Probabilities (n operational states) 


P(t) = QP(t) 
P'(t) = e^PCO) 


System Equation 


Solution 


P(t) « 2 e 


ajt 


i-1 


„ n 

i?iQzz£ 

j#i °’i“ CT j 


Rk(0 = 2 Pi(0 

i=l 


ARIES Solution using 
Sylvesters formula 


Reliability of k th subsystem 


m 


R(o = n Rk(o 

k=l 


System Reliability fm subsystems') 


112 


MOTIVATION AND CRITERIA FOR 
TEST CASES 


Should demonstrate how to use the tool 

Should demonstrate general applicability and 
limitations (not an exhaustive test of features) 

Should be solvable using standard techniques 


113 



TEST CASES DO NOT DEMONSTRATE 
FULL CAPABILITY OF CARE m OR ARIES 


In Particular for CARE m, the Following Important 
Features were not Highlighted 

• The Impact of Aggregation of States in Handling Very 
Large Systems 

• Full Use of Fault Handling Model Features 

• Non-constant Failure Rates 
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TEST CASES 

Simplex Processor 
TMR with No Spares 
M out of N 

M out of N with Triple Fault 
TMR with Powered Scares 
TMR with Unpowered Spares 
AlPS-like FCS 



TEST CASE 1 


Description 

A simplex processor with a constant failure 
rate X. 

Purpose 

A simple case to point out any computational, 
as opposed to modelling, differences. 

Solution 

P(SF) = 1 - e xt 


~ Xt for small Xt 




Xt 

Direct Calculation 

ARIES* 

CAREm 

0 

0 

0 

0 

1.59 x KT 23 

-2.78 x 10" 17 

0 

1.59 x 10" 23 

1.30 x 10" 19 

-2.78 x 10" 17 

0 

1.30 x 10" 19 

1 x 10" 16 

6.94 x 10" 17 

6.94 x 10" 17 

9.99 x 10" 17 

5 x 10" 16 

4.85 x 10" 16 

4.85 x 10" 16 

5.00 x 10" 16 

1 x 10" 15 

9.99 x 10" 15 

9.99 x 10" 16 

1.00 x 10" 15 

1 x 10" 12 

9.99 x 10" 13 

9.99 x 10" 13 

9.99 x 10" 13 

1 x 10" 10 

9.99 x 10" 11 

9.99 x 10" 11 

1.00 x 10" 10 

1 x 10" 3 

9.995 x 10" 4 

9.995 x 10" 4 

9.995 x 10" 4 

1 

6.32 x 10" 1 

6.32 x 10" 1 

6.32 x 10" 1 

For X > 10 15 , all methods 

give the same answer. 


CARE in answers are accurate for mudi smaller values of 
Xt. 

Increased accuracy of CARE III results from computing 
unreliability directly. 


♦ARIES reports reliability. Unreliability was obtained by subtracting 
ARIES reliability answers from 1. 
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TEST CASE 2 


Description 


A TMR system with no spares. 


Constant failure rate X. 


Purpose 


Another basic system. 
Solution 


P(SF) = 1 + 2e -3xt - 3e _2Xt 


Results 


t Direct ARIES 1 ARIES 7 


0 

1 

5 

10 

.01 

.10 

7000 


0 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

2.99996139042E-12 

2.99995001063E-10 

.50512196468 


0 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

2.99996E-12 

2.99995E-10 

.50512196468 


0 

2.999500043E-8 

7.4937529682E-7 

2.99500474671E-6 

2.99996E-12 

2.99995E-10 

.50512196468 


CARE ffl 

0 

2.9994993156E-8 

7.4937446470E-7 

2.9950040243E-6 

2.9999939165E-12 

2.9999519535E-10 
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TEST CASE 3 


Description: M out of N 


12 Processors 
Perfect Coverage 
SF iff 7 or More Faults 


7 or More Faults 

X = 10“ 4 /hour/processor 
t = 8000 hours 

Purpose 


A Basic M Out of N System 


Solution 


12 

P(SF) = 2 


12 ! 


k ~ 7 k!(12-k)! 


pV 2 k 


where p = 1 - e 
q = e“ xt 


-xt 


Results 


Direct Calculation ARIES Typel ARIES Typc7 CARFTTT 

t=8000 .5288303411826796 .52883034118268013 .52883034118268137 .52883034118 

The initial results for ARIES Typel were incorrect: perfect reliability was 
maintained for more than 10,000 hours and then dropped several orders of 
magnitude. It was determined that ARIES was not correctly computing sys- 
tems with > 1 degradation and no spares. A modification was made to pro- 
duce correct results. 
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TEST CASE 4 


Description: MoutofN 
5 processors 

X = 10“ 4 /hour/processor 
Permanent Faults 
SF iff 

• 4 or more faults 

• Faults preclude majority of good processors 
Single Fault Model 

• A fault is detected immediately as it produces 
an error 

• Single point faults are excluded 

• Only 2 concurrent active faults can cause SF 
Imperfect Coverage 

Pa “ Pb - 1 

€ = 0 
c = 1 
p = 0 

8 = 3600/hour 

Purpose 

An M out of N System with Triple Fault 



TEST CASE 4 (cont’d) 


Model 



Solution 

Since A./5 is small, instantaneous coverage can be used to 
simplify the model. 

! 

Exhaustion of Components 


\ 

\ 





Test Case 4 
(continued) 

Dominant Term Due to Lack of Coverage 

[Pcov,5-4-F(t)] = -jj S+4X | [s+4A ] Cl(1_C2) 

By Partial Fraction Expansion and _1 

Pcov, 5-4-fO) — Cx(l-C2) Jl-5e~ 4Xt + 4e _sxt J 

The Exhaustion of Components Is Approximately the Failure of 4 
Out of 5. 

Thus, 

Psrft) - y- [l — 5e~ 4xt + 4e -5Xl ] + 5 (l-e- w )V xt + (l-e"**) 5 
For Small Xt 

PsF<t) “ + 5 (l-e- x *) 4 e- x * + (l-e- xt ) 5 
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Test Case 4 
(continued) 


Results 


1 Direct ARIES 1 ARIES 7 CARE m 

10 5.81936 x 10" 12 5.81801 x 10” 32 5.81907 x 10~ 12 5.81935 x 10" 12 

This case is not directly computable by CARE HI due to the triple fault. Our result was 
obtained by using CARE HI to solve for the P(SF) due to loss of 4 out of 5 processors 
and adding the hand-calculated P(SF) due to lack of coverage. 

The ARIES Type 1 solution is an approximation made by including the path from 5 good 
to 3 good in the path from 5 good to 4 good. 
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TEST CASE 5 


Description 

TMR with 2 Powered Spares and Permanent Faults 
X = 10 -4 /hour/processor 
t = 10 hours 
Imperfect Coverage 

p = 0 

c = 1 
€ = 0 

Pa=Pb= 1 
8 = 3600/hour 

Purpose 

Easily Analyzed Using Instantaneous Coverage 
Well Suited for CARE m and ARIES 

Results 

Direct ARIES 1 ARIES 7 CAREm 

t=10 1.7165269 x 10" 10 1.7165291 x 10“ 10 1.716543 x 10“ 10 1.7068601501 x 10" 10 
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TEST CASE 5 


Solution 

Since X/8 is very small, the above model can be simplified by using 
instantaneous coverage. 



PsfM “ (1-c) [l— (e- 3xt )l + 5(l-e- x, ) 4 e- x, + (l— c~ Xt ) 5 


Coverage Components 

(4 of 5) 


PsfM “ + 5(i-e- x, )V x ' + (i- e - x *) 5 
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TMR with Powered Spares 
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Time (HOURS) 


TEST CASE 6 


Description 

TMR with 7 Unpowered Spares and Permanent Faults 
X = 10“ 4 /hour 
t = 10 years = 87,600 hours 
Imperfect Coverage 

p = 0 

c = 1 
€ = 0 

Pa = Pb = 1 
8 = 3600/hour 

Purpose 

Markov Model Has Multiple Eigenvalues 
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Solution 


Lack of Coverage 


|Pcov(Oj — 
^Pcov(0j “ 


m-A 

s 

3X(l-c) 

S(S+3X) 


1 3Xc 

+ T 


S+3X (S+3X) 2 


P cov(0 ~ 


2X 

8 


[l— e -3xt ] 


(3M 7 

(S+3X) 8 


Exhaustion of Components 


[p.J _ 2X 

f 

3Xc 

l Pc£ t J S+2X 

S+3X I 
/ 


8 


By Partial Fraction Expansion for Repeated Roots and 


PcE(t) “ l-3 8 e -2xt + e~ 3x * 


-1 


6560 + 2186(3X)t + 728(3X) 2 -^- + 242(3X) 3 

dm 


* + MMUL + !(3X)‘^ X 2<3X)’-£j 


All Terms Up to the 9th Power Cancel 

Source of Computational Problems in First 1000 Hours. 
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Type 1 

Will Not Accept p « 0 
Should Accept u. = 

10 # 

Unmodified Version Accepts p = -jgg- 
Modified Version Accepts p = ~ 

Type 7 , 

Eigenvalues Are — 3A (8 times and — 2A) 

To Solve System, Duplicates Are Dropped So System Is 
Solved with 2 eigenvalues, — 3 A and — 2A 
Solution for This System Is Same As For TMR With No Spares 
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TMR with Unpowered Spares 



Time (HOURS ) 
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TEST CASE 7 


Purpose 

To Highlight Assumptions Required to Use ARIES 
and CARE III for an AlPS-like Architecture 

• The Two Triplex Sets Simulate FTMP 

• The Reversion to the Second Triplex Set 
Simulates Functional Migration (Assume Perfect) 

• Assumed Network Does Not Impact Reliability 

• Triple, Near-Coincident Faults Are Not a Factor 

in Loss of System. For This, Quintuplex Processors 
Are Needed. Thus, CARE HE Does Not Need to 
Accommodate More than 2 Near-Coincident Faults. 

• Sequence-Dependent Faults Are Not a Factor in 
Loss of System Because of the Reliabiliy of the 
Bus Network 



Description 


Test Case 7 
(continued) 


• AIPS-Like FCS 
Quad sensors, 
Quad Actuators, 
Triplex Processors, 
Triplex Processors, 
Permanent Faults 
Perfect Coverage 

t = 10 hours 

• System Operation 


X§ = 10“ 4 /hour/sensor (8 sets) 

X A = 10“ 4 /hour/actuator (8 sets) 
Xpi = 10” 3 /hour/processor (1 set) 
Xp2 = 10“ 3 /hour/processor (1 set) 


After loss of the triplex processor set, its functions are performed 
by the second triplex set, provided that it is still functional 


Second triplex set was formerly performing non-critical functions 
and was not vulnerable to critical fault pairs. 

• LOCiff 

Loss of a Sensor Set (1 of 8) 

Loss of an Actuator Set (1 of 8) 

Loss of Processing Function 

Loss of 2 of the First Triplex Set 
Loss of 2 of the Second Triplex Set 
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Test Case 7 
(continued) 

Solution 

Independent Subsystems Leads to 
Structural Decomposition 

P[SF] - P(Es) + P(Ea) + PCEpiEpz) 

Es = Loss of Sensor Set (1 of 8) 

E a = Loss of Actuator Set ( 1 of 8) 

Epi = Loss of Primary Processor 

Ep 2 = Loss of Backup Processor * 

P[Esl - 32X| t 3 

P[Ea] = 32Xi t 3 

PfEpiEpJ = P[E p i]P[Ep2] 

= 9Xpt 4 

P[SF] =* 32X| t 3 + 32Xi t 3 + 9Xp t 4 
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FAULT TREE 
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QUAD SENSORS TRIPLEX QUAD ACTUATORS 

PROCESSORS 





TEST CASE 7 


Model (Processor sets) 
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TEST CASE 7 
(continued) 


Results 

Direct ARIES CARE m 

t*10 1.54 X 10“ 7 1.5090900654 X 10“ 7 1.5090886052 X 10” 7 

Each Subsystem Fits ARIES Type 1 Model 
Series Configuration of Subsystems Assumed 
Processor Subsystems are not Configured Serially — Can Be 
Combined Into One Type 7 Subsystem 

CARE IE Fault Tree Allows More Flexibility in Configuring 
System 
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USER-RELATED 


Advantages 

Interactive. 

Can save and reload a system. 

Help facility. 

Output in plottable format. 

Can create log Hie. 

Can accept input from a command file. 

! j : 

Can Compute Other Performance. Measures: 

Mean time to first failure. 

System failure rate. 

Normalized probability of failure. 
Reliability improvement factor 
(one system vs. another). 

Mission time improvement factor 
(one system vs. another). 

Life-Cycle measures 
( for single subsystem) . 

Disadvantages 

Cannot modify a system and reload it. 

Cannot exit from define command prompts. 
Necessary information scattered throughout 
user’s guide. 

No support. 


CARE m USER-RELATED 


Advantages 

System Fault Tree Input 
Easily Modified Input Files 
Output Provides Feedback 
Output Options 
Limited Plotting Capability 

Disadvantages 

Not Fully Interactive 


OBSERVATIONS REGARDING CARE HI 


• Does Not Handle Near Coincident Faults for N>2 

• Does Not Handle Sequence Dependent Faults 

• Double Fault Model Is Conservative 

• Designed for Ultrareliable Regime 

• Spares Must be Powered 

• Fault Handling Model While Somewhat Flexible 
is Restrictive in Some Respects, e.g.. Only 
One Entry Point Identical Transition Rates 

on Intermittent States, Does Not Depend on 
System State, Fault Handling Time Assumed Short 
Relative to Failure Rate 

• Closed System 

• Not Fully Interactive 
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Three Fault Model 





f 
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System Model for Example 2 
When 2 Active Faults = LOC 


CARE in LIMITATIONS FOR AIPS APPLICATIONS 


• Evaluates Closed System 

(Some AIPS Applications Include Maintenance) 

• Sequence Dependent Faults Not Directly Evaluated 
(Function Migration, Partial Cross-Strapping) 

• Unpowered Spares Not Handled 
(Unpowered Spares A Must for Space) 

• AIPS Needs Tool to Evaluate Network Reliability 


CARE HI FEATURES OF POTENTIAL VALUE TO AIPS 

• Fault Handling Model 

• Handles Large Systems 

• Evaluation of Reliability in Ultrareliable 
Regime 

• Non Constant Failure Rates 

• Double Fault Model 
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ARIES CONCLUSIONS 


MODEL 

Advantages 

Flexible with respect to spares 

powered, unpowered, blocked 

Parametric instantiation of six 
predefined system models 

Accepts matrix description of systems 


Disadvantages 

Instantaneous coverage (computed externally 
by user) 

Constant transition rates 

Spares can be unpowered but must have a 
nonzero failure rate no smaller than 
X/10 6 and sufficiently large 
to insure distinct eigenvalues 


ACCURACY 


Only reports reliability to seven digits 
Unverified and unsupported 
Bugs 

Inaccurate results for typel systems with 
more than 1 degradation and no spares 

Inaccurate copy of subsystems 

Various errors in interactive prompts 

Calculation of reliability as opposed to unreliability 
subjects it to computational stress 

An accuracy parameter has to be adjusted to get 
accurate results for type 7 systems 

Inaccurate results can occur when eigenvalues are 
not distinct 



CONCLUSION 


• Subject to Limitations Previously Stated 
CARE HI Can Be Used to Assess Reliability 
of AlPS-like Architectures. 

• While ARIES Has A Number of Desirable Features, 
Its Limited Accuracy and its Status with Respect 

to Validation are Sufficient to Rule Out Its 
Use For AIPS. 
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